124 research outputs found

    A global Approach to the Comparison of Clustering Results

    Get PDF
    Copyright © 2012 Walter de Gruyter GmbH.The discovery of knowledge in the case of Hierarchical Cluster Analysis (HCA) depends on many factors, such as the clustering algorithms applied and the strategies developed in the initialstage of Cluster Analysis. We present a global approach for evaluating the quality of clustering results and making a comparison among different clustering algorithms using the relevant information available (e.g. the stability, isolation and homogeneity of the clusters). In addition, we present a visual method to facilitate evaluation of the quality of the partitions, allowing identification of the similarities and differences between partitions, as well as the behaviour of the elements in the partitions. We illustrate our approach using a complex and heterogeneous dataset (real horse data) taken from the literature. We apply HCA based on the generalized affinity coefficient (similarity coefficient) to the case of complex data (symbolic data), combined with 26 (classic and probabilistic) clustering algorithms. Finally, we discuss the obtained results and the contribution of this approach to gaining better knowledge of the structure of data

    Distribution of the Affinity Coefficient between Variables based on the Monte Carlo Simulation Method

    Get PDF
    This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global exchange of knowledge.The affinity coefficient and its extensions have both been used in hierarchical and non-hierarchical Cluster Analysis. The purpose of the present empirical study on the distribution of the basic and the generalized affinity coefficients and on the distribution of the standardized affinity coefficient, by the method of Wald and Wolfowitz, under different assumptions, is to assess the effect of the statistical probability distributions of the variables (columns) of the initial data matrix, and of the respective parameters, in the distribution of the values of these coefficients. We present some results concerning the asymptotic distribution of the referred coefficients under the assumption that the variables (for which the values of these coefficients are calculated) are independent and have statistical probability distributions specified apriori. In this distributional study, based on the Monte Carlo simulation method, we considered ten well-known statistical probability distributions with different variations of the respective parameters. The simulation studies lead to the conclusion that the coefficients’ convergence for the normal distribution is quite fast and, in general, a good approximation is obtained for small sample sizes, that is for sample sizes above 20 and in many cases for sample sizes above 10

    Clustering of Symbolic Data based on Affinity Coefficient: Application to a Real Data Set

    Get PDF
    Copyright © 2013 Walter de Gruyter GmbH.In this paper, we illustrate an application of Ascendant Hierarchical Cluster Analysis (AHCA) to complex data taken from the literature (interval data), based on the standardized weighted generalized affinity coefficient, by the method of Wald and Wolfowitz. The probabilistic aggregation criteria used belong to a parametric family of methods under the probabilistic approach of AHCA, named VL methodology. Finally, we compare the results achieved using our approach with those obtained by other authors

    Quality evaluation of a selected partition : An approach based on resampling methods

    Get PDF
    The aim of this work on cluster analysis is to provide a methodology to analyse and assess the quality of a selected partition (the best partition according to several validation indexes). In the proposed approach, the evaluation of the stability and of the consistency of the results of the selected partition (original partition) was done using the comparison between this partition and each of the partitions (with the same number of clusters that the original one) obtained by resampling. A special emphasis is given to an index defined by linear combination of four indicators, which allows evaluating the adjustment between the original partition and each of the partitions (and / or set of obtained partitions) obtained from resampling data. The application of these indexes is exemplified using a set of real data, and the main conclusions are summarized and discussed.CICS.UAc/CICS.NOVA.UAc, UID/SOC/04647/2013, and this paper was produced with support from the FCT/MEC thru National Funds and when applied co-financed by the FEDER within the partnership agreement PT2020.info:eu-repo/semantics/publishedVersio

    On clustering interval data with different scales of measures : experimental results

    Get PDF
    This article is is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Attribution-NonCommercial (CC BY-NC) license lets others remix, tweak, and build upon work non-commercially, and although the new works must also acknowledge & be non-commercial.Symbolic Data Analysis can be defined as the extension of standard data analysis to more complex data tables. We illustrate the application of the Ascendant Hierarchical Cluster Analysis (AHCA) to a symbolic data set (with a known structure) in the field of the automobile industry (car data set), in which objects are described by variables whose values are intervals of the real data set (interval variables). The AHCA of thirty-three car models, described by eight interval variables (with different scales of measure), was based on the standardized weighted generalized affinity coefficient, by the method of Wald and Wolfowitz. We applied three probabilistic aggregation criteria in the scope of the VL methodology (V for Validity, L for Linkage). Moreover, we compare the achieved results with those obtained by other authors, and with a priori partition into four clusters defined by the category (Utilitarian, Berlina, Sporting and Luxury) to which the car belong. We used the global statistics of levels (STAT) to evaluate the obtained partitions

    Clustering an interval data set : are the main partitions similar to a priori partition?

    Get PDF
    This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.In this paper we compare the best partitions of data units (cities) obtained from different algorithms of Ascendant Hierarchical Cluster Analysis (AHCA) of a well-known data set of the literature on symbolic data analysis (“city temperature interval data set”) with a priori partition of cities given by a panel of human observers. The AHCA was based on the weighted generalised affinity with equal weights, and on the probabilistic coefficient associated with the asymptotic standardized weighted generalized affinity coefficient by the method of Wald and Wolfowitz. These similarity coefficients between elements were combined with three aggregation criteria, one classical, Single Linkage (SL), and the other ones probabilistic, AV1 and AVB, the last ones in the scope of the VL methodology. The evaluation of the partitions in order to find the partitioning that best fits the underlying data was carried out using some validation measures based on the similarity matrices. In general, global satisfactory results have been obtained using our methods, being the best partitions quite close (or even coinciding) with the a priori partition provided by the panel of human observers

    Tide-induced variations in the bacterial community, and in the physical and chemical properties of the water column of the Mondego estuary

    Get PDF
    The bacterioplankton is a key component of the structure and function of aquatic ecosystems. Yet, present understanding of the controls on microbial abundance and activity only highlights their complexity. In estuaries, the problem is further complicated by the high variability of environmental properties (salinity, temperature, pH, organic loading and other factors). The present study investigates the dynamics of three main metabolic groups of planktonic bacteria involved in the cycling of organic matter (aerobic heterotrophic bacteria, sulphate-reducing bacteria, and nitrate-reducing bacteria), over one tidal cycle in the estuary of the Mondego. The association of various physical, chemical and biological parameters with the composition of the bacterial community was assessed by multivariate analysis in order to identify key factors controlling the composition and tidal dynamics of the bacterial communities in the Mondego estuary. Principal component analysis (PCA) identified the sources of variability for the bacterial communities in the estuary, as being, on one hand, the different dynamics in the two stations under study (Foz and Pranto) and, on the other hand, the flood and ebb tide fluxes, by their effects in the environmental parameters.PRAXIS/P/MGS/11238/1998 - Impacto humano sobre a dinâmica estuarina de matéria e energia – bases para a gestão integrada de ecossistemas estuarinos - PROGRAMA PRAXIS XXI/98.info:eu-repo/semantics/publishedVersio

    Incident type 2 diabetes attributable to suboptimal diet in 184 countries

    Get PDF
    The global burden of diet-attributable type 2 diabetes (T2D) is not well established. This risk assessment model estimated T2D incidence among adults attributable to direct and body weight-mediated effects of 11 dietary factors in 184 countries in 1990 and 2018. In 2018, suboptimal intake of these dietary factors was estimated to be attributable to 14.1 million (95% uncertainty interval (UI), 13.814.4 million) incident T2D cases, representing 70.3% (68.871.8%) of new cases globally. Largest T2D burdens were attributable to insufficient whole-grain intake (26.1% (25.027.1%)), excess refined rice and wheat intake (24.6% (22.327.2%)) and excess processed meat intake (20.3% (18.323.5%)). Across regions, highest proportional burdens were in central and eastern Europe and central Asia (85.6% (83.487.7%)) and Latin America and the Caribbean (81.8% (80.183.4%)); and lowest proportional burdens were in South Asia (55.4% (52.160.7%)). Proportions of diet-attributable T2D were generally larger in men than in women and were inversely correlated with age. Diet-attributable T2D was generally larger among urban versus rural residents and higher versus lower educated individuals, except in high-income countries, central and eastern Europe and central Asia, where burdens were larger in rural residents and in lower educated individuals. Compared with 1990, global diet-attributable T2D increased by 2.6 absolute percentage points (8.6 million more cases) in 2018, with variation in these trends by world region and dietary factor. These findings inform nutritional priorities and clinical and public health planning to improve dietary quality and reduce T2D globally. (c) 2023, The Author(s)
    corecore